Typicality Defended 



Don N. Pag<0 

Institute for Theoretical Physics 
Department of Physics, University of Alberta 

Room 238 CEB, 11322 - 89 Avenue 
Edmonton, Alberta, Canada T6G 2G7 1 , and 
Asia Pacific Center for Theoretical Physics, Pohang 790-784, Korea 
(Dated: 2007 July 27) 

Hartle and Srednicki have argued that there is no observational evidence favoring our typicality. 
Here it is shown that such evidence does arise from including the normalization principle requirement 
that the sum of the likelihoods for all possible observations is normalized to unity in each theory. 
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I. INTRODUCTION 

As Hartle and Srednicki [l| correctly note, "An increas- 
ingly common kind of reasoning in fundamental cosmol- 
ogy starts from an assumption that some property of hu- 
man observers is typical in some class C of ob ject s in the 
universe," for example in 0, |, 0, 1, i 0, & |,M EL El 
which they cite later. They go on to claim [l[ that "it 
is perfectly possible (and not necessarily unlikely) for us 
to live in a universe in which we are not typical." While 
I agree that it is perfectly possible, in this paper I shall 
argue that it would be unlikely when one properly nor- 
malizes the likelihoods. 

That is, I shall argue that within each possible theory 
of the universe, the likelihood would be small that we 
are atypical, though one could assign a sufficiently high 
prior probability to a theory in which we are atypical to 
overcome this small likelihood. That is, the theory might 
have such high a priori probability that after a Bayesian 
analysis it could have the highest a posteriori probability 
even though it makes us atypical and unlikely. However, 
purely from the likelihoods, properly normalized, typi- 
cality is favored, contrary to what Hartle and Srednicki 
conclude when they do not require that the sum of the 
likelihoods for all possible observations sum to unity. 

A key issue to be discussed below is how likelihoods are 
to be defined by a theory, which seems to lie at the heart 
of Hartle and Srednicki's disagreement with calculations 
favoring typicality. Another key issue is the gap between 
the first sentence of their conclusion (v), "We have data 
that we exist in the universe, but we have no evidence 
that we have been selected by some random process," 
with which I agree, and the second sentence of that con- 
clusion, "We should not calculate as though we were," 
with which I disagree. 

Before getting into these points of disagreement, it may 
be helpful to list points of agreement. 

I do agree with Hartle and Srednicki's conclusion (i), 
"A theory is not incorrect merely because it predicts that 
we are atypical." Low typicality merely implies low like- 



lihood, but one must also consider the prior probability 
assigned to the theory. 

I also agree with conclusion (iii), "No part of our data 
should be neglected in the process of discriminating be- 
tween competing theories unless it can be demonstrated 
that the relevant probabilities are insensitive to it." 

Strictly speaking, I also agree with conclusion (vi), 
whose first sentence is, "In a fundamental theory of quan- 
tum cosmology, there is no need for any assumption of 
typicality to predict what we might see." I would say 
that all we need to do is consider theories that predict 
correctly normalized likelihoods for all possible observa- 
tions, and then the typicality will be automatically re- 
flected in these likelihoods. 

I furthermore strongly agree with Hartle and Srednicki 
in using Bayesian probability theory [l3|, [TJ, [l5[ , with its 
prior probability P(Ti) for each theory T^, its likelihoods 
P(Dj\Ti) or conditional probabilities for each possible 
data set Dj given the theory T,, and its posterior prob- 
abilities P(Ti\Dj) or reverse conditional probabilities for 
the theories Ti given the particular data set Dj that is ob- 
tained. These posterior probabilities are given in terms 
of the prior probabilities and the likelihoods by Bayes' 
theorem, 



E^ml^i)^ > (T i )• 



(i.i) 



In particular, I concur with Hartle and Srednicki that 
Bayesian analysis provides a framework for distinguishing 
"facts, logical deduction, and prejudices." As they nicely 
express it, "Data are the domain of facts, likelihoods are 
the domain of logical deduction, and the priors are the 
domain of theoretical prejudice." 

In this Bayesian approach, my key difference from Har- 
tle and Srednicki is that I propose that one follow the 
normalization principle: One should only consider theo- 
ries that each give likelihoods summing to unity for all 
possible data sets, 
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I shall take an atypical observation to be an observed 
data set that has an anomalously low likelihood, so that 
atypical observations are unlikely, giving small weights in 



2 



Bayes' theorem. (I think of observations as being more 
fundamental than observers and hence shall focus on typ- 
ical or atypical observations rather than on typical or 
atypical observers, but one could define an atypical ob- 
server as one who makes atypical observations.) 

Here I am not using the technical definition of typi- 
cality I have proposed in flrl 17L [HI , which after email 
discussions with Srednicki [19 1 I realize has some prob- 
lems that I shall discuss elsewhere [20| . but the looser 
idea that atypical observations are those that would be 
unlikely to be chosen in a random selection from all ob- 
servations predicted by the theory. Although a precise 
definition is not needed here, it might help to have the 
following definition in mind as an example: 

Consider all the observed data sets Dj predicted by 
some theory T, and rank them in decreasing order of 
their normalized likelihoods P(Dj\Ti). Define the median 
observation D m as the one with the smallest value of m in 
this ordered sequence such that J2j< m P(Dj\Ti) > 1/2. 
Then one might define the typicality of any observed data 
set Dj in this theory as being tj = P(D j \T i )/P(D m \T i ). 
For j < m, the typicality is large, tj > 1, so that at 
least half of the likelihood occurs for observed data sets 
with high typicality. Atypical observations would corre- 
spond to low typicality, tj <C 1, and would occur only for 
j > m and a small fraction of the total amount of likeli- 
hood. That is, it is unlikely that an observation would be 
atypical if it were selected randomly with a probability 
given by its normalized likelihood. 



II. DATA 

There are different ideas for what should constitute a 
data set Dj to be used in a Bayesian probability analysis. 
Since each theory Ti is supposed to assign a likelihood 
P(Dj\Ti) to each data set Dj, it should be the theory 
that defines the possible data sets. Thus different ideas 
of what the data sets are may be considered as differences 
in the theories. However, to compare different theories by 
a Bayesian analysis, they should all have data sets that 
are members of some single encompassing set of data sets, 
say S. Then for any theory whose data sets form only a 
proper subset of S, one can simply say that that theory 
predicts zero likelihood for all other data sets of S. In 
this way we can say that each theory Ti assigns a unique 
value to the likelihood P(Dj\Ti) for each data set Dj in 
the full set S of such data sets. 

Although my argument does not depend on which full 
set of data sets S is chosen (so long as it is precisely de- 
fined), let me give some possible choices. The one that 
seems the most fundamental to me is the set, say Si, of all 
possible conscious perceptions 0, [l?], O HH, HH, [23, H3] • 
Roughly, each individual conscious perception is all that 
a conscious observer is aware of at once, what Bostrom 
[25l | calls an observer- moment. If this conscious percep- 
tion is regarded as a data set, the data would be the 
content of that awareness. In this Si, each different pos- 



sible conscious perception would be a member, and any 
two perceptions with different contents would be different 
data sets. 

Hartle and Srednicki use an HSI, a human scientific 
IGUS (information gathering and utilizing system), with 
the data set including "every scrap of information that 
the HSI possesses about the physical universe: every 
record of every experiment, every astronomical observa- 
tion of distant galaxies, every available description of ev- 
ery leaf, etc., and necessarily every piece of information 
about the HSI itself, its members, and its history." Al- 
though they consider only the data set D that our partic- 
ular HSI has (and thereby avoid the issue of normalizing 
the likelihoods over all possible data sets), one can cer- 
tainly consider all such data sets, say forming the set 5*2. 
Every such data set within S2 would differ if it had differ- 
ent scraps of information or different information within 
at least one of the scraps. 

Another possible set, say S3, of data sets would be 
the set of all complete physical descriptions of all planets 
(including what is on them, of course). Each physical 
different planet would give a different data set in this set 
of data sets. 

Yet another possible set of data sets, say S4, would be 
the set of all complete descriptions of the causal past of 
any event of spacetime (assuming for this that spacetime 
has a definite causal structure, which is not likely to be 
true in quantum gravity). 

For any particular Bayesian analysis, one should have 
a definite set S of well-defined possible alternative data 
sets. There should be some parallelism between the data 
sets within S, so it would not appear to be a good idea, 
for example, to use an S that is the union of the set Si 
of all possible conscious perceptions and the set S2 of all 
possible HSI data sets, since each HSI data set may con- 
tain one or more (usually many) conscious perceptions. 

Because the set S of data sets must be well defined, 
with distinct members (the data sets themselves), the 
data sets within S must all be different and can be re- 
garded as different alternatives, different possible obser- 
vations. By assumption, an observation (whether a con- 
scious perception, all the data of an HSI, or all the data 
of a planet) is of a distinct data set Dj with S, and 
therefore each theory Tj should assign a definite likeli- 
hood P(Dj\Ti) for each data set Dj. Since the data sets 
are alternatives of what might be observed, and since by 
assumption some particular data set actually is observed, 
for each theory Ti the sum of the likelihoods (the con- 
ditional probabilities of the data sets, given the theory) 
should sum to unity, the normalization condition (jl.2p . 



III. LIKELIHOODS 

Each normalized likelihood P(Dj\Ti) that I have dis- 
cussed above may be regarded as the probability, condi- 
tional upon the theory Ti, that an observation (randomly 
chosen without restricting the data) gives the data set 
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Dj . If one considers the observation to be made by an 
observer (whether a single conscious being at one time, a 
human scientific information gathering and utilizing sys- 
tem, an entire planet, or an entire region of spacetime), 
it is a 'first-person' observation, a distinct alternative to 
any other first-person observation of a different data set. 
Since as first-person observations, the different data sets 
are mutually exclusive (the first-person at one time ob- 
serves only one data set), their probabilities should add 
up to one, the normalization principle expressed by Eq. 

(EH). 

Hartle and Srcdnicki implicitly discard the first-person 
nature of the observation. Instead of using the full first- 
person knowledge, "We observe the data set D," they 
consider only the reduced third-person knowledge and 
say, "All we know is that there exists at least one such 
region containing our data. " Therefore, instead of cal- 
culating the likelihood of our data D as the normalized 
likelihood of this data set out of all other possible first- 
person data sets, they effectively calculate the (different) 
probability that this data set exists in at least one region. 
That is, they consider not all possible data sets Dj in S as 
the alternatives, but simply the binary alternatives that 
our particular data set D occurs somewhere and that D 
does not occur somewhere [l9| . 

These two procedures, theirs and mine, are equivalent 
in usual laboratory experiments in which only one data 
set actually occurs (in a single branch of the Everett 
many- worlds wavefunction) . Then if any data set Dj 
occurs that is different from D, D does not also occur, so 
the probability that D does not occur is the sum of the 
probabilities for all Dj different from D. But in a large 
enough universe, many different data sets can all actu- 
ally occur. Then the probability that D does not occur 
can be much smaller than the sum of the probabilities 
for each of the other data sets to occur. 

The third-person existence of two different data sets, 
Dj and D^ with j ^ k, is not mutually exclusive or incon- 
sistent, but the first-person observation of two different 
data sets is mutually exclusive. Therefore, the third- 
person existence probabilities for the data sets can be 
different from the first-person observational probabilities 
for these same data sets. 

Ordinary quantum theory with a complete orthogo- 
nal set of projection operators, or the consistent histo- 
ries app roach with a decoherent set of class operators 
H,iliiii> is well-suited for calculating the third- 
person existence probabilities. But if one wants the first- 
person observational probabilities, one needs something 
more. 

For example, consider a toy model for S that consists of 
only two possible data sets, D\ and D 2 . Suppose that the 
quantum state is such that with unit probability, there 
exists 1 region with an observer observing the data set 
D\, and 999 regions that each have an observer observing 
the data set D 2 . Ordinary quantum theory would give 
the third-person existence probability of both D\ and D 2 
as unity. Since these two existence possibilities are not 



mutually exclusive, their existence probabilities do not 
add up to 1 but rather to 2. On the other hand, it would 
be quite reasonable to assume that the first-person obser- 
vational probabilities are the same for all 1 000 regions, 
so that the normalized probability of D\ is 0.001 and of 
£>2 is 0.999. That is, there are 999 times as many re- 
gions with D2 as there are with D\ (and we assume that 
there is nothing else of importance, other than these dif- 
ferent data sets, distinguishing the observers in the two 
regions, so all of them can be considered to have equal 
weight), so the probability of observing D 2 is 999 times 
the probability of observing D\ . 

If the data sets are conscious perceptions, then one 
way of getting normalized probabilities for each of them 
is by the framework of Sensible Quantum Mechanics 
3 Ell EE EH, H HI or Mindless Sensationalism [H, 
which in the discrete normalizable case assigns a proba- 
bility to each conscious perception that is the expectation 
value of a corresponding positive 'awareness operator.' 
There is no requirement that these positive operators be 
orthogonal to each other or even be proportional to pro- 
jection operators (though they might be approximately 
proportional to the integral over all of spacetime of pro- 
jection operators in local regions). In the example above, 
assuming that the operator corresponding to Di (for the 
first region) and to D 2 (for the remaining 999 regions) 
receives the same contribution to the expectation value 
from each region, then since there are 999 times as many 
regions giving D2, the corresponding awareness opera- 
tor would have 999 times the probability as that for Di, 
leading to the same probabilities as in the previous para- 
graph. 

Hartle and Srednicki [l[ object that calculations like 
this one "make the selection fallacy that we are randomly 
chosen from a class of objects by some physical process, 
despite the absence of any evidence for such a process," 
further stating in a particular example, "In fact, there 
has been no selection at all. " As mentioned in the Intro- 
duction, I do agree with them that "We have data that 
we exist in the universe, but we have no evidence that we 
have been selected by some random process," but I dis- 
agree with their conclusion that "We should not calculate 
as though we were." If the universe does have many ob- 
servers, there is indeed no physical selection within them 
of which exist and which do not, since they all exist in 
the third-person sense. However, to interpret one's first- 
person experience, it is perfectly legitimate to calculate 
as if it were randomly selected from the set of all obser- 
vations. 

Bostrom has cogently argued in [2||, p. 162, for the 
Strong Self-Sampling Assumption (SSSA): "One should 
reason as if one's present observer-moment were a ran- 
dom sample from the set of all observer-moments in its 
reference class." This is similar to how I might today 



11 1 state my Conditional Aesthemic Principle (CAP) 
16| : "Unless one has compelling contrary evidence, one 



should reason as if one's conscious perception were a ran- 
dom sample from the set of all conscious perceptions." I 
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would argue that the reference class of all observer- 
moments (which I would call conscious perceptions, each 
being all that one is consciously aware of at once) should 
be the universal class of all observer-moments. 

Comments analogous to that about the "selection fal- 
lacy" may be made about the different branches of the 
wavefunction in the many-worlds interpretation of quan- 
tum theory, in which the wavefunction never collapses. 
All of the branches with nonzero amplitude may be con- 
sidered actually to occur, with no real physical selection 
between them, but for an observer predicting what he 
may observe in the future, it may be legitimate to make 
what might be called the Copenhagen fallacy and calcu- 
late as if there were probabilities for the various branches 
of the wavefunction to be selected, say by a postulated 
collapse of the wavefunction. Just as in this many-worlds 
case where it may be legitimate to reason as if there are 
probabilities for the selection of a particular branch of the 
wavefunction (even if in fact there is no such selection), 
so in the many-observations case it may be legitimate to 
reason as if there are probabilities for the selection of a 
particular observation. 

Let me make the parallel between the Copenhagen fal- 
lacy and the selection fallacy more explicit: 

Collapse of the wavefunction is false (the "Copenhagen 
fallacy"). But we can calculate likelihoods as if it hap- 
pens and use them in a Bayesian analysis to get posterior 
probabilities for theories. 

Selection of observers is false (the "selection fallacy"). 
But we can calculate likelihoods as if it happens and use 
them in a Bayesian analysis to get posterior probabilities 
for theories. 



IV. CONSEQUENCES 

When the full first-person information about an obser- 
vation or observed data set is taken into account ("We 
observe D"), and not just the third-person account ("D 
exists"), then we can consider all the different data sets 
to be mutually exclusive and hence have normalized like- 
lihoods. It is then natural to have likelihoods that vary 
monotonically with the typicality assigned to the obser- 
vation, so that less typical observations have lower like- 
lihood. More simply, atypical observations are unlikely. 

The requirement that the likelihoods be normalized 
means that it does matter what other observations are 
possible in a theory, besides what we may actually ob- 
serve. A theory that predicts a huge number of other pos- 
sible observations of significant relative likelihood, say by 
Boltzmann brains @, i, 0, @, ©, 0, HO, G3 , would tend to 
give a lower likelihood for our observation than a theory 
that does not. This contradicts the second sentence of 
Hartle and Srednicki's conclusion (ii): "What other ob- 
servers might see, how many of them there are, and what 
properties they do or do not share with us are irrelevant 
for this process." 

Requiring the likelihoods of observations to be nor- 



malized first-person probabilities, instead of the third- 
person existence probabilities, also releases theories from 
the enormous limitations of the second sentence of Har- 
tle and Srednicki's conclusion (iv): "Cosmological models 
that predict that at least one instance of our data exists 
(with probability one) somewhere in spacetime are indis- 
tinguishable no matter how many other exact copies of 
these data exist." If one were forced to abide by that 
limitation, then a huge variety of cosmological models 
with a sufficiently large universe (spatially noncompact 
cosmologies, and also spatially compact cosmologies with 
enough inflation) would give nearly unit probability for 
our data set and hence the same likelihoods. Thus ob- 
servations would count for nothing in distinguishing be- 
tween these theories, and much of cosmology would cease 
to be an observational science. 

Carter [3(| has noted that the assumptions of Hartle 
and Srednicki, considering only our data and not what 
other observers might see, is an example of what, in com- 
parison with the anthropic principle of assuming that we 
are typical until it is shown otherwise [30| . he has la- 
beled [31| "the more sterile and restrictive autocentric 
principle." Since Hartle and Srednicki's arguments im- 
ply that one could not distinguish observationally any 
cosmological theory that gives unit (or even any other 
equal) likelihood to our observed data, it certainly seems 
better to choose other principles that lead to varying like- 
lihoods and hence the possibilities of testing cosmological 
theories observationally. 

If Hartle and Srednicki's assumptions were adopted, 
then theories with a sufficiently vast and varied multi- 
verse to predict the existence of our data with near cer- 
tainty would all have the same weight in a Bayesian anal- 
ysis, greater than that of any theory that predicted the 
existence of our data with significantly less than unit like- 
lihood. This would seem to give an unfair advantage to 
multiverse theories. It would also make them subject to 
the criticism that they explain everything (since a huge 
variety of data sets would then have nearly unit existence 
likelihood) and thereby explain nothing. This seems far 
too cheap a solution to the goal of science to explain our 
observations. 

A suitable multiverse theory might turn out to be the 
best explanation of our observations, but it should have 
to earn that status by its high prior probability (from 
such considerations as being "simple, beautiful, precisely 
formulable mathematically, economical in their assump- 
tions, comprehensive, unifying, explanatory, accessible 
to existing intuition, etc. etc.," as Hartle and Srednicki 
nicely put it) and by not too low a nontrivial value it 
gives for the likelihood of our observed data set. Re- 
placing the normalized first-person observational proba- 
bilities with the third-person existence probabilities is a 
cheat, like putting the theory on steroids. 

In a Bayesian analysis to try to find a theory with the 
maximum a posteriori probability, it seems unlikely that 
one can avoid the tension between trying to make the 
a priori probability high and trying to make the likeli- 
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hood of our observations also high. For me, the simplest 
theory, with the highest a priori probability, would be 
the theory that nothing exists. However, the likelihood 
of our observations would then be zero, so this theory 
is ruled out observationally. (It would also run into the 
problem of not obeying the normalization principle, since 
it would give no nonzero observational likelihoods at all 
to normalize.) The theory that everything existed would 
seem to me to be the next simplest theory and hence have 
the next highest a priori probability. If one then used 
existence probabilities (conditional upon the theory) as 
likelihoods, as Hartlc and Srcdnicki seem to advocate, 
then this simple theory would give unit likelihoods for 
all possible observations and hence presumably the high- 
est a posteriori probability. 

Should we then quit physics and say that we have the 
best possible theory of everything, namely the simple the- 
ory that everything possible exists? I would say that this 
is far too cheap an answer. 

If instead we include the normalization principle as I 
am advocating, then one would have to normalize the 
likelihoods of all the observations. If, in the theory that 
everything exists, one made the simple assignment that 
all of the infinite number of possible observations have 
equal likelihood, then their normalized value would be 
zero, and the resulting a posteriori probability for this 
theory would be zero, as indeed I would say it is. One 
could of course try to go to an improved theory in which 
although all possible observations exist, they have vary- 
ing likelihoods that arc normalized. Then we are back 
to the problem of assigning nontrivial likelihoods, which 
complicates the theory and reduces its a priori proba- 
bility. Thus we have the challenge of finding the best 
theory that neither is so complicated that it makes its 
a priori probability too small, nor has the normalized 
likelihoods spread so thinly that it makes the likelihood 
of our observation too small (e.g., by making us highly 
atypical) . 

One might try to go to the other extreme, maximizing 
the typicality of our observed data set by formulating 
the theory to predict that data set and only that data 
set, thereby giving it unit likelihood. However, it would 
be very surprising if any theory existed that predicted 
our observations uniquely and was fairly simple. Since 
such a theory is likely to be quite complicated, it would 
naturally be assigned a low a priori probability. Most 
scientists would presumably believe that even if one has 
to reduce the likelihood and typicality of our observed 
data from unity, one can gain far more in the a pri- 
ori probability for a simpler theory. In other words, it 
seems improbable that only our observed data exists or 
has nonzero probability, and much more probable that 
the correct theory predicts non-unit first-person observa- 
tional probability for our data. 

If we postulate that the first-person observational 
probabilities that a theory predicts are not true probabil- 
ities for an actual selection of our data from all possible 
data sets, but rather measures for the actual existence of 



the various data sets, then giving up on finding a simple 
theory predicting unit likelihood for our data set is equiv- 
alent to saying that other data sets actually do exist. In 
this way we are led to a many-observations theory. The 
many might be provided by a sufficiently large universe, 
by the many-worlds of the Everett version of quantum 
theory, and/or by a string landscape. It seems that the 
trade-off between a priori probabilities and likelihoods 
suggests that many different observed data sets exist, but 
not all possible observed data sets exist equally (i.e., with 
equal measures or equal likelihoods of being observed). 



V. EXAMPLE 

Let us take some set S of all possible data sets un- 
der consideration and consider theories to explain one 
of them. Suppose that the set of all possible data sets 
is countable (though logically it need not be, if for ex- 
ample they form a continuum). Imagine that there is a 
procedure for ordering them by their complexity, so that 
D\ represents the simplest possible data set, and so on. 
Then Dj is the jth simplest data set. Let us assume that 
our observed data set has j 3> 1, so that what we observe 
is by itself not extremely simple. 

The theory, say T\ , that gives the maximum likelihood 
for this data set would be the one that predicts that 
it alone occurs uniquely, so that it has unit likelihood, 
P(Dj\Ti) = 1. Assuming the background knowledge of 
the S of all the possible data sets and their ordering by 
complexity, theory T\ could be specified simply by giv- 
ing the integer j. For most integers j of similar value, 
this information would not be compressible, so one could 
say that the number of bits of information in this single- 
observation theory is roughly log 2 j. 

Next, consider an alternative multi-observation theory 
in which the likelihoods for the various possible data sets 
come from some specific normalized probability distri- 
bution. For simplicity and concreteness, consider the- 
ory Tjv which gives the geometric distribution with mean 
N > 1, so P(Dj\T N ) = (N- ly-i/W. If we wanted to 
choose N to maximize the likelihood P(Dj IT2), we would 
need to choose N = j. However, this theory, Tj, would 
have more information than T\ (with the extra amount, 
above that specifying N = j, saying that Tj gives a ge- 
ometric distribution). Hence this Tj theory would pre- 
sumably be assigned a lower a priori probability than 
T\. Furthermore, it would also give a lower likelihood, 
P{D j \T i ) = (j - « e-i/j « 1 - PiDjlT,), 

where the approximation and strong inequality occurs 
for j 3> 1, as I am assuming. Therefore, Tj would give 
a much lower a posteriori probability than T\, showing 
that multi-observation theories need not be better than 
single-observation theories. 

On the other hand, Tjv can be chosen to be much 
simpler than Tj (assuming the generic case in which j 
is an incompressible large integer, with roughly log 2 j 
bits of incompressible information) by choosing N to 
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be much simpler than j. However, to keep the likeli- 
hood P(Dj\Tj) from becoming too much smaller than 
the maximum value P{Dj\Tj) ~ e /j for fixed j 
and N allowed to vary, N should be chosen to be 
roughly the same value as j, say within a factor of 2 
of j. For example, if N « j/2, then P{Dj\T N ) w 
(2/j>- 2 « (2/e)P(U i |T i ) « 0.736P(^|T J ) » 0.271/j, 
whereas if N w 2j, then P(Dj\T N ) w (0.5/j)e-°- 5 w 
( A /i/2)P(L> J |r j ) w 0.824P(D i |T J ) w 0.303/j. Thus for 
JV within a factor of 2 of j, we always get P(Dj|Tjv) > 
V(4j). 

Now we can simply choose N to be the nearest power 
of 2 less than or equal to j, N = 2^° S2 ^ , with the square 
bracket denoting the integer part of the logarithm to base 
2. Then in binary, N has a 1 followed by [log 2 j] O's, 
the same as j with all binary digits after the first trun- 
cated to 0. Thus whereas specifying j requires all [log 2 j] 
binary digits after the leading 1 to be specified, with 
[log 2 j] bits of information, N just requires a specifica- 
tion of how many O's it has after the leading 1, which 
is just [log 2 [log 2 j]] bits of information. For very large 
generic (incompressible) j, N thus has much less infor- 
mation than that in j itself. 

Therefore, if the gain in the a priori probability of TV 
from its relative simplicity of N, over that of the more 
complex Ti, overcomes the decrease in the likelihood of 
the observed data set Dj from unity for T\ to near l/(4j) 
for Tjv, then in Bayes' theorem, Eq. (jl.ip . the a posteri- 
ori probability P(Tn\Dj) for the multi-observation the- 
ory T N with N = 2^ l °S2^ will exceed P(Ti\Dj) for the 
single-observation theory T\. In particular, this would 
be the case if the prior probabilities obey the inequality 
P(T N )>4jP(T 1 ). 

Suppose that one re-orders the Ti, Tv, and other possi- 
ble theories (assumed to be countable) into increasing or- 
der of complexity and lets / be the integer that gives this 
new order, from 1 for the simplest theory, on up through 
successively larger integers for more complex theories. 
Then T\ and Tm for integers N > 1 will all have places 
in this order, so that one will get the function I{i) where 
i = 1 for T\ and i = N for Tjy. (Of course, the resulting 
infinite countable set of values I{i) will not exhaust the 
positive integers, since there will also be another count- 
ably infinite set of other theories whose Ps will partially 
intertwine the I(i)'s.) I will be roughly 2 to the power 
of the number of bits needed to specify the theory, so 
7(1) - j and I(N) ~ log 2 j for N = 2P og ^'l and j > 1. 

Now let us suppose that we take the a priori proba- 
bilities P(Tj) = p(I) to be a monotonically decreasing 
function of the order of complexity I, so that simpler 
theories are assigned higher prior probabilities and more 
complex theories are assigned lower prior probabilities. If 
the prior probabilities as a function of/, p{I), fall off too 
slowly with /, then one will get that the posterior proba- 
bility of the single-observation theory T\ is greater than 
that of the multi-observation theory Tjv for any N, such 
as the simple choice N = 2 f lo S2 j! . However, if p(I) falls 
off sufficiently rapidly with /, then instead the simpler 



multi-observation theory will be favored with the higher 
a posteriori probability. 

In the example above, it appears that it is sufficient 
for p(I) to fall off at least as rapidly as I~ s for any 
s > 1, or even as i"~ 1 (ln/)~ s for any s > 1. For ex- 
ample, consider p(I) = (6/7r 2 )/ -2 , which would give 
a normalized prior probability distribution for a count- 
ably infinite set of theories ordered by complexity, with 
/ = I for the simplest, etc. This set of priors would then 
give the ratio of the posterior probability of the multi- 
observation theory Tjv to that of the single-observation 
theory T Y as P{T N \D 3 ) / PiT^D,) ~ j/(41og 2 j) » I. 
Thus with this choice of priors, the greater simplicity of 
the multi-observation theory over the single-observation 
theory would more than compensate for the reduced like- 
lihood it gives to the observed data set, so in the end the 
multi-observation theory is favored. 

Another sim ple set of prior probabilities that I have 
advocated [H, 113, El is p(I) = ■ Since this very 
strongly favors simpler theories (each of which is twice 
as probable a priori as the next simplest), in the exam- 
ple above it gives a much higher posterior probability 
to the multi-observation theory: P(T N \Dj)/P(Ti\Dj) ~ 
2i/(4j 2 ). 

The situation is somewhat similar to theories of solip- 
sism versus theories in which other people are real. Solip- 
sism would give a higher likelihood for one's observations, 
but it is not nearly so simple as theories that other people 
are real. Therefore, when one chooses prior probabilities 
falling sufficiently rapidly with complexity (as humans 
apparently do implicitly without even consciously think- 
ing about it), in the end one favors theories in which 
other people are real. 

The example above shows that typicality itself, in the 
form of increased likelihoods, often is not sufficient to 
overcome the higher prior probabilities one might like to 
assign to simpler theories that may predict larger num- 
bers of possible observed data sets and correspondingly 
lower likelihoods for each. However, this does not work if 
the simpler theory predicts too large a range of observed 
data sets and hence makes the normalized likelihood of 
each one too small. In particular, theories that predict 
that all possible observations, out of an infinite set, occur 
with equal likelihood give zero likelihood for any partic- 
ular data set and hence have zero posterior probabilities 
(unless absolutely all of the prior probability is concen- 
trated upon such theories). 



VI. CONCLUSIONS 

We have seen that when one imposes the normalization 
principle and restricts to theories that each give likeli- 
hoods summing to unity for all possible data sets, typi- 
cality is automatically favored in the likelihoods. Since 
this preference comes directly from the normalized like- 
lihoods, it is not and need not be introduced "through 
a suitable choice of priors" as Hartle and Srednicki I] 
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suggest. Instead, the prior probabilities for theories may 
be chosen to "favor theories that are simple, beautiful, 
precisely formulable mathematically, economical in their 
assumptions, comprehensive, unifying, explanatory, ac- 
cessible to existing intuition, etc. etc.," as Hartle and 
Srednicki propose. 

The only sense in which I could be said to favor putting 
typicality into the priors would be the interpretation that 
imposing the normalization principle effectively assigns 
zero prior probabilities to theories in which the likeli- 
hoods of all possible observations do not sum to unity, 
as I would indeed do if that interpretation were forced 
upon me. But not imposing this requirement does not 
seem to me to make sense (and also leads to many sterile 
cosmological theories that cannot be tested against ob- 
servations). It seems rather analogous to not imposing 
the requirement of mathematical consistency. Therefore, 
I would argue that the normalization principle is a funda- 
mental principle of probability for multi-observation the- 
ories that need not be listed among the optional proper- 
ties Hartle and Srednicki have nicely enumerated for the 
theoretical prejudice of choosing the priors. 

Typicality by itself does not guarantee that the theory 
with the highest posterior probability will make us typi- 
cal. However, typicality is favored in the likelihoods. One 



need not impose it separately, but in discussions in which 
one does not explicitly invoke the full Bayesian frame- 
work, assuming typicality may be a legitimate shortcut 
for selecting between different theories for our observa- 
tions. We are unlikely to be highly atypical. 
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